Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

نویسندگان

چکیده

This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed facilitate similarity measure retrieval from single global semantic multi-level semantics. However, these methods may suffer following limitations: (1) largely ignore relationship which results in levels are insufficient; (2) it is incomplete constrain real-valued features different modalities be same space only through feature distance measurement; (3) fail handle problem that distributions attribute labels heavily imbalanced. overcome above limitations, this proposes a novel cross-modal alignment network (MCSAN) for by jointly modeling on global, entity, action and unified deep model. Specifically, both video text first decomposed into carefully designing spatial–temporal learning structures. Then, we utilize KLDivLoss parameter-share projection layer as statistical constraints ensure representations projected common space. In addition, focal binary cross-entropy (FBCE) loss function presented, effort model unbalanced distribution MCSAN practically effective take advantage complementary information among four levels. Extensive experiments two challenging datasets, namely, MSR-VTT VATEX, show viability our method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi modal multi-semantic image retrieval

.................................................................................................................... viii ACKNOLWEDGEMENTS .................................................................................................... x ABBREVIATIONS ................................................................................................................. xi CHAPTER 1 INTRODUCTION ...

متن کامل

MHTN: Modal-adversarial Hybrid Transfer Network for Cross-modal Retrieval

Cross-modal retrieval has drawn wide interest for retrieval across different modalities of data (such as text, image, video, audio and 3D model). However, existing methods based on deep neural network (DNN) often face the challenge of insufficient cross-modal training data, which limits the training effectiveness and easily leads to overfitting. Transfer learning is usually adopted for relievin...

متن کامل

Learning Deep Semantic Embeddings for Cross-Modal Retrieval

Deep learning methods have been actively researched for cross-modal retrieval, with the softmax cross-entropy loss commonly applied for supervised learning. However, the softmax cross-entropy loss is known to result in large intra-class variances, which is not not very suited for cross-modal matching. In this paper, a deep architecture called Deep Semantic Embedding (DSE) is proposed, which is ...

متن کامل

Correlation Hashing Network for Efficient Cross-Modal Retrieval

Due to the storage and retrieval efficiency, hashing has been widely deployed to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which improves the quality of hash coding by exploiting the semantic correlation across different modalities, has received increasing attention recently. For most existing cross-modal hashing methods, an object is first r...

متن کامل

Semiautomatic Image Retrieval Using the High Level Semantic Labels

Content-based image retrieval and text-based image retrieval are two fundamental approaches in the field of image retrieval. The challenges related to each of these approaches, guide the researchers to use combining approaches and semi-automatic retrieval using the user interaction in the retrieval cycle. Hence, in this paper, an image retrieval system is introduced that provided two kind of qu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Mathematics

سال: 2022

ISSN: ['2227-7390']

DOI: https://doi.org/10.3390/math10183346